An analysis of the Covid-19 Pandemic

Marco Huang (0201), Jingyun Li (0101) Medical-Illustration-2.jpeg

Introduction

COVID-19 is the disease caused by SARS-CoV-2, the coronavirus that emerged in December 2019. COVID-19 can be severe, and has caused millions of deaths around the world as well as lasting health problems in some who have survived the illness. The coronavirus can be spread from person to person. It is diagnosed with a test.

Three years after the break out of the corona virus, the growth of the COVID confirmation rate seems to slow down, providing us with the best timing to examine the pandemic as a whole. Here in part one, we would like to look at COVID in the United States, and several representative states in specific, and discuss what the data illustrates to us. For part two, we're going to look at the relation between confirmation cases and housing prices.

Part 1: About COVID-19

1. Data Collection

In this project the Covid-19 data we used comes from Johns Hopkins University and is available at this link: https://github.com/CSSEGISandData/COVID-19

1.1 Tool used

We used the following tools to collect this data: pandas, numpy, matplotlib, scikit-learn, seaborn, os, folium, and more.

In [ ]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
import warnings
import os
import scipy.stats as stats
from statsmodels.formula.api import ols as o
from sklearn import linear_model
import re
warnings.filterwarnings('ignore')

1.2 Data processing

1.2.2 US overall

We want to first look at the overall confirmed and death cases in the US. Here we read the data of the confirmed covid cases throughout the whole world from 1/22/20 till today. For this project, We would focus on the united states.

Below is the global confirmation data. It includes all the countries, their latitude, longitude, and the daily cumulative confirmation among all these countries.

In [ ]:
world_conf = pd.read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv", sep=',')
world_conf.head()
Out[ ]:
Province/State Country/Region Lat Long 1/22/20 1/23/20 1/24/20 1/25/20 1/26/20 1/27/20 ... 12/6/22 12/7/22 12/8/22 12/9/22 12/10/22 12/11/22 12/12/22 12/13/22 12/14/22 12/15/22
0 NaN Afghanistan 33.93911 67.709953 0 0 0 0 0 0 ... 206331 206414 206465 206504 206543 206603 206702 206743 206788 206879
1 NaN Albania 41.15330 20.168300 0 0 0 0 0 0 ... 333455 333472 333490 333491 333521 333533 333535 333567 333591 333613
2 NaN Algeria 28.03390 1.659600 0 0 0 0 0 0 ... 271122 271128 271135 271140 271146 271146 271147 271149 271156 271156
3 NaN Andorra 42.50630 1.521800 0 0 0 0 0 0 ... 47219 47446 47446 47446 47446 47446 47446 47446 47606 47606
4 NaN Angola -11.20270 17.873900 0 0 0 0 0 0 ... 104750 104808 104808 104808 104808 104808 104808 104808 104946 104946

5 rows × 1063 columns

We first extract the confirmation data for the US from the world data frame. We then calculated the increase of confirmation covid for every day in between and transposed it afterward.

In [ ]:
us_conf = pd.melt(world_conf, ['Province/State','Country/Region', 'Lat', 'Long'], var_name="Date", value_name='conf_cases')
us_conf = us_conf.drop(columns=['Province/State', 'Lat', 'Long'])
us_conf = us_conf.rename(columns={'Country/Region': 'Country'})
us_conf["Date"] = pd.to_datetime(us_conf['Date'])
us_conf = us_conf.groupby(['Country', 'Date']).sum()
us_conf["Next_day"] = us_conf['conf_cases'].shift(fill_value=0)
us_conf["conf_change"]= us_conf['conf_cases'] - us_conf['Next_day']
us_conf = us_conf.drop(columns=['Next_day'])
us_conf = us_conf.reset_index()
us_conf = us_conf[us_conf["conf_change"] >= 0] 
us_conf = us_conf[us_conf["Country"]=="US"]
us_conf = us_conf.set_index("Date")
us_conf = us_conf.drop(columns=['Country'])
us_conf.head()
Out[ ]:
conf_cases conf_change
Date
2020-01-23 1 0
2020-01-24 2 1
2020-01-25 2 0
2020-01-26 5 3
2020-01-27 5 0

Below is the global death data. It includes all the countries, their latitude, longitude, and the daily cumulative death among all these countries.

In [ ]:
world_death = pd.read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv", sep=',')
world_death.head()
Out[ ]:
Province/State Country/Region Lat Long 1/22/20 1/23/20 1/24/20 1/25/20 1/26/20 1/27/20 ... 12/6/22 12/7/22 12/8/22 12/9/22 12/10/22 12/11/22 12/12/22 12/13/22 12/14/22 12/15/22
0 NaN Afghanistan 33.93911 67.709953 0 0 0 0 0 0 ... 7835 7837 7839 7839 7839 7839 7840 7843 7843 7843
1 NaN Albania 41.15330 20.168300 0 0 0 0 0 0 ... 3594 3594 3594 3594 3594 3594 3594 3594 3594 3594
2 NaN Algeria 28.03390 1.659600 0 0 0 0 0 0 ... 6881 6881 6881 6881 6881 6881 6881 6881 6881 6881
3 NaN Andorra 42.50630 1.521800 0 0 0 0 0 0 ... 157 158 158 158 158 158 158 158 158 158
4 NaN Angola -11.20270 17.873900 0 0 0 0 0 0 ... 1925 1925 1925 1925 1925 1925 1925 1925 1928 1928

5 rows × 1063 columns

We did the same thing for the death data -- we calculated the increase of death cases for every day in between and transposed it afterward.

In [ ]:
us_death = pd.melt(world_death, ['Province/State','Country/Region', 'Lat', 'Long'], var_name="Date", value_name='death_cases')
us_death = us_death.drop(columns=['Province/State', 'Lat', 'Long'])
us_death = us_death.rename(columns={'Country/Region': 'Country'})
us_death["Date"] = pd.to_datetime(us_death['Date'])
us_death = us_death.groupby(['Country', 'Date']).sum()
us_death["Next_day"] = us_death['death_cases'].shift(fill_value=0)
us_death["death_change"]= us_death['death_cases'] - us_death['Next_day']
us_death = us_death.drop(columns=['Next_day'])
us_death = us_death.reset_index()
us_death = us_death[us_death["death_change"] >= 0] 
us_death = us_death[us_death["Country"]=="US"]
us_death = us_death.set_index("Date")
us_death = us_death.drop(columns=['Country'])
us_death.head()
Out[ ]:
death_cases death_change
Date
2020-01-22 0 0
2020-01-23 0 0
2020-01-24 0 0
2020-01-25 0 0
2020-01-26 0 0

We then joined the tables into a data frame us_overall. The new data frame has confirmed cases, daily confirmed the change, death cases, and daily death change data all in one.

In [ ]:
us_overall = us_conf.join(us_death, how='outer')
us_overall.head()
Out[ ]:
conf_cases conf_change death_cases death_change
Date
2020-01-22 NaN NaN 0.0 0.0
2020-01-23 1.0 0.0 0.0 0.0
2020-01-24 2.0 1.0 0.0 0.0
2020-01-25 2.0 0.0 0.0 0.0
2020-01-26 5.0 3.0 0.0 0.0

1.2.2 US states

Below is the confirmed cases in each state of the United States. Here we would also want to sort out some states that is representative of a certain area. Below is the states we picked for this project, we selected one state for each of the 9 regions.

  • New England: Maine
  • Middle Atlantic: New York
  • East North Central: Wisconsin
  • West North Central: Kansas
  • South Atlantic: Maryland
  • East South Central: Alabama
  • West South Central: Texas
  • Mountain: Arizona
  • Pacific: California

We first read in the data from the Hopkins site.

In [ ]:
conf = pd.read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_US.csv", sep=',')
conf.head()
Out[ ]:
UID iso2 iso3 code3 FIPS Admin2 Province_State Country_Region Lat Long_ ... 12/6/22 12/7/22 12/8/22 12/9/22 12/10/22 12/11/22 12/12/22 12/13/22 12/14/22 12/15/22
0 84001001 US USA 840 1001.0 Autauga Alabama US 32.539527 -86.644082 ... 18680 18752 18752 18752 18752 18752 18752 18752 18847 18847
1 84001003 US USA 840 1003.0 Baldwin Alabama US 30.727750 -87.722071 ... 66730 66951 66951 66951 66951 66951 66951 66951 67221 67221
2 84001005 US USA 840 1005.0 Barbour Alabama US 31.868263 -85.387129 ... 6980 6989 6989 6989 6989 6989 6989 6989 7007 7007
3 84001007 US USA 840 1007.0 Bibb Alabama US 32.996421 -87.125115 ... 7637 7653 7653 7653 7653 7653 7653 7653 7668 7668
4 84001009 US USA 840 1009.0 Blount Alabama US 33.982109 -86.567906 ... 17500 17559 17559 17559 17559 17559 17559 17559 17648 17648

5 rows × 1070 columns

We concact the confirmation data for all the counties located in the nine states we selected into nine entries, each one of them have a specific cumulative number of confirmation per day. And eventually, for the sake of better data representation, we transposed the data frame.

In [ ]:
MD = conf[conf["Province_State"] == "Maryland"]
frames = [MD]
confirmed = MD.drop(conf.columns[0:11], axis=1)
confirmed = confirmed.append(confirmed.sum(numeric_only=True), ignore_index=True)
confirmed.drop(confirmed.index[0:26], inplace=True)
list1 = ["Maine", "New York", "Wisconsin", "Kansas", "Alabama", "Texas", "Arizona", "California"]
for x in list1:
  state = conf[conf["Province_State"] == x]
  frames.append(state)
  time = state.drop(state.columns[0:11], axis=1)
  sum = time.append(time.sum(numeric_only=True), ignore_index=True)
  confirmed = confirmed.append(sum.sum(numeric_only=True), ignore_index=True)
confirmed = confirmed.rename(index={0: 'Maryland', 1: 'Maine', 2: 'New York', 3: 'Wisconsin', 4: 'Kansas', 5: 'Alabama', 6: 'Texas', 7: 'Arizona', 8: 'California'})
result = pd.concat(frames)
confirmed = confirmed.swapaxes("index", "columns")
confirmed.index = pd.to_datetime(confirmed.index)
confirmed.head()
Out[ ]:
Maryland Maine New York Wisconsin Kansas Alabama Texas Arizona California
2020-01-22 0 0 0 0 0 0 0 0 0
2020-01-23 0 0 0 0 0 0 0 0 0
2020-01-24 0 0 0 0 0 0 0 0 0
2020-01-25 0 0 0 0 0 0 0 0 0
2020-01-26 0 0 0 0 0 0 0 2 4

We then read in number of deaths by state in the US.

In [ ]:
us_dead = pd.read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_US.csv", sep=',')
us_dead.head()
Out[ ]:
UID iso2 iso3 code3 FIPS Admin2 Province_State Country_Region Lat Long_ ... 12/6/22 12/7/22 12/8/22 12/9/22 12/10/22 12/11/22 12/12/22 12/13/22 12/14/22 12/15/22
0 84001001 US USA 840 1001.0 Autauga Alabama US 32.539527 -86.644082 ... 230 230 230 230 230 230 230 230 230 230
1 84001003 US USA 840 1003.0 Baldwin Alabama US 30.727750 -87.722071 ... 716 717 717 717 717 717 717 717 717 717
2 84001005 US USA 840 1005.0 Barbour Alabama US 31.868263 -85.387129 ... 103 103 103 103 103 103 103 103 103 103
3 84001007 US USA 840 1007.0 Bibb Alabama US 32.996421 -87.125115 ... 108 108 108 108 108 108 108 108 108 108
4 84001009 US USA 840 1009.0 Blount Alabama US 33.982109 -86.567906 ... 259 260 260 260 260 260 260 260 260 260

5 rows × 1071 columns

Then we did the same thing to the deaths data for the nine states: we concact the death data for all the counties located in the nine states we selected into nine entries, each one of them have a specific cumulative number of death daily, and we transposed the data frame.

In [ ]:
MD2 = us_dead[us_dead["Province_State"] == "Maryland"]
death = MD2.drop(MD2.columns[0:12], axis=1)
death = death.append(death.sum(numeric_only=True), ignore_index=True)
death.drop(death.index[0:26], inplace=True)
for x in list1:
  state2 = us_dead[us_dead["Province_State"] == x]
  time2 = state2.drop(state2.columns[0:12], axis=1)
  sum2 = time2.append(time2.sum(numeric_only=True), ignore_index=True)
  death = death.append(sum2.sum(numeric_only=True), ignore_index=True)
death = death.rename(index={0: 'Maryland', 1: 'Maine', 2: 'New York', 3: 'Wisconsin', 4: 'Kansas', 5: 'Alabama', 6: 'Texas', 7: 'Arizona', 8: 'California'})
death = death.swapaxes("index", "columns")
death.index = pd.to_datetime(death.index)
death.head()
Out[ ]:
Maryland Maine New York Wisconsin Kansas Alabama Texas Arizona California
2020-01-22 0 0 0 0 0 0 0 0 0
2020-01-23 0 0 0 0 0 0 0 0 0
2020-01-24 0 0 0 0 0 0 0 0 0
2020-01-25 0 0 0 0 0 0 0 0 0
2020-01-26 0 0 0 0 0 0 0 0 0

2. Data representation and analysis

2.1 Overall trend among all regions in the US

2.1.1 Confirmation trend in the US

We first utilized the data from our dataframe us_overall to plot the cumulative growth throught the whole entire timespan of covid.

In [ ]:
us_overall.plot(y="conf_cases", legend=None, figsize=(15,10))
Out[ ]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f530bfbe0a0>

Through the plot above, it is reasonable to conclued that the growth rate is coming to a flat point after the sudden surge around June 2022. One possible reason for the sudden surge we suspect is that June and July are the popular months that people would like to go on vacation at. This might increases the chance of contact. Another reason could be that there was a policy change over mask wearing, as a result more and more people stop wearing mask as they used to. Not wearing mask might also be a reason that lead to the increase of infection.

We then utilized the daily increase for covid confirmation to plot a daily increase plot. This plot illustrate the rate of change better than the one we got above.

In [ ]:
us_overall.plot(y="conf_change", legend=None, figsize=(15,10))
Out[ ]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f530be76d90>

What we discovered in the first plot is also illustrated by this plot. Indeed, this plot captures more change comparing to the first one because of the nature of derivatives. As we can see, the surge in confirmation numbers is more drastic in this plot.

2.1.2 Deaths trend in the US

We then utilized the data in our us_overall to plot for the cumulative death rate.

In [ ]:
us_overall.plot(y="death_cases", legend=None, figsize=(15,10))
Out[ ]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f530b8fe5e0>

There has always been a steady growth in the cumulative death rate just like that of cumulative confirmation rate. However, one noticeble fact we could see from this plot is around the time that a sudden surge occurrs to the covid confirmation number, the total death number did not change much. We suspect that is due to the fact that the sympton of covid is more mild comparing to how it was in the very beginning

Down below is the daily increase plot we plotted with the data from our us_overall data frame.

In [ ]:
us_overall.plot(y="death_change", figsize=(15,10), legend=None)
Out[ ]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f530b84e3a0>

This plot could also illustrates the point we claimed above. Death rate through July 2021 to 2022 flunctuates between 1000 to 4000 person per day, but there is no specific pattern that shows any sudden surge around June 2022.

2.2 Trend among all regions in the US

Now lets look at the nine specific states we talked about in the first part.

  • New England: Maine
  • Middle Atlantic: New York
  • East North Central: Wisconsin
  • West North Central: Kansas
  • South Atlantic: Maryland
  • East South Central: Alabama
  • West South Central: Texas
  • Mountain: Arizona
  • Pacific: California
In [ ]:
confirmed.plot(figsize=(15,10))
Out[ ]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f530b7ee340>

Cases in nine states continue to trend upward. It is obvious that the number of confirmed cases in June of 2022 increased significantly, which is good because none of the states above showed a tendency that is opposite to that of the overall confirmation tendency. Also, it is worth noticing that more states by the costs have higher cumulative comfirmation rate. This is mainly due to the fact that coast areas have a higher population, and they are more likely to cointain a higher porportion of wai lai ren kou!

In [ ]:
death.plot(figsize=(15,10))
Out[ ]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f5309ddefa0>

We then did the same thing to the 9 states selected as what we did to the global confirmed and death database-- we calculated the daily increase confirmation rate throught the nine selected states and transposed the dataframe.

In [ ]:
result = result.drop(result.columns[[0,1,2,3,4,7,10]], axis=1)
result = pd.melt(result, ['Admin2','Province_State', 'Lat', 'Long_'], var_name="Date", value_name='Cases')
result = result.drop(columns=['Province_State'])
result = result.rename(columns={'Admin2': 'Admin', 'Long_': 'Long'})
result["Date"] = pd.to_datetime(result['Date'])
result = result.groupby(['Admin', 'Date']).sum()
result["Next_day"] = result['Cases'].shift(fill_value=0)
result["Daily_change"]= result['Cases'] - result['Next_day']
result = result.drop(columns=['Next_day'])
result = result.reset_index()
result = result[result["Daily_change"] >= 0]

To illistrate the change better, we put them into the US map and represent the daily increase as a heat map. Below is the code we used to generate the heat map.

In [ ]:
result["Date"] = result["Date"].astype(str)
fig = px.scatter_geo(result, lat="Lat", lon="Long",
                     hover_name="Admin", size="Daily_change",size_max=80,
                     animation_frame="Date",
                     scope = "usa",
                     title = "Total Cases")
fig.layout.updatemenus[0].buttons[0].args[1]["frame"]["duration"] = 50
fig.show()

3. Hypothesis testing

As what we talked about how the growth of covid confirmation rate is disorportional to that of death rate. We suspect that the covid symptons nowadays is getting more mild. Therefore, we would want to test whether the mortality rate is lower than how it was before.

The mid point of covid is roughly in the middle of June. So we would set that as a middle point and conduct hypothesis testing for the data before the middle of June 2021 and that after.

$H_0$: the mean proportion of death before June 2021 is equal to that after June 2021

$H_\alpha$: the mean proportion of death before June 2021 is greater than that after June 2021

Here is a visualization of the covid mortality rate. There seems to be a tendency that slowly went down little by little.

In [ ]:
us_overall['proportion'] = us_overall.apply(lambda row: row['death_change'] / row['conf_change'], axis=1)
us_overall.plot(y= 'proportion', legend=None, figsize=(15,10))
Out[ ]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f530669a3a0>

After analyzing the motality rate data, we did some data cleaning. We first got rid of all the NaN values, then we split the dataframe in half.

In [ ]:
us_overall = us_overall.dropna()
first = us_overall.head(515)
last = us_overall.tail(515)

We then did a paired t-test on both sets of data as shown below.

In [ ]:
stats.ttest_rel(first["proportion"], last["proportion"])
Out[ ]:
Ttest_relResult(statistic=14.502805772619178, pvalue=3.3602135329649983e-40)

The result p-value for the paired t-test is roughly 3.36e-40, which is significantly less than the alpha value of 0.05. Therefore we could reject the null hypothesis that the mean proportion of death before June 2021 is equal to that after June 2021. So then we could conclude that there might be a decrease in the covid mortality rate.


Part 2: COVID and housing price

1. Data Collection

In this part of the project, we want to see the correlation between housing price and covid confirmation rate. Since we already have the covid dataset ready, in this part we would want to ellabrate more on the housing data that we obtained.

We gathered our housing data from https://www.fhfa.gov/DataTools/Downloads/Pages/House-Price-Index.aspx. In this dataset we have all nine reigon in the US as columns and monthly HPI on each row. Below is some background regarding the HPI.

The FHFA House Price Index (FHFA HPI®) is the nation’s only collection of public, freely available house price indexes that measure changes in single-family home values based on data from all 50 states and over 400 American cities that extend back to the mid-1970s. The FHFA HPI incorporates tens of millions of home sales and offers insights about house price fluctuations at the national, census division, state, metro area, county, ZIP code, and census tract levels. FHFA uses a fully transparent methodology based upon a weighted, repeat-sales statistical technique to analyze house price transaction data.

In [ ]:
housing = pd.read_csv("/content/HPI_PO_monthly_hist.csv")
housing.index = pd.to_datetime(housing['Month'])
housing = housing.drop(columns=["Month"])
housing = housing.drop(housing.index[0])
housing.head()
Out[ ]:
East North Central East South Central Middle Atlantic Mountain New England Pacific South Atlantic West North Central West South Central USA
Month
2020-02-01 232.67 260.54 249.97 392.87 264.70 327.64 289.09 278.23 291.92 282.50
2020-03-01 235.97 262.93 251.29 397.67 268.38 331.57 290.31 281.34 294.82 285.24
2020-04-01 238.33 265.19 253.43 399.73 270.87 334.22 292.29 284.73 296.96 287.59
2020-05-01 239.57 266.47 255.06 399.92 269.87 332.23 293.87 285.82 297.94 288.36
2020-06-01 243.02 269.12 256.54 404.90 274.06 336.72 296.63 289.66 303.16 291.92

2. Data management/representation + Exploratory data analysis

To better illustrate the tendency of housing price, we plotted a month vs HPI plot.

In [ ]:
housing.plot(y="USA",legend=None,  figsize=(15,10))
Out[ ]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f530bfe8f70>

The housing price for the US as a whole have been steadily rising till recently. It somehow started to drop by around May 2022. Because it was steadily increasing along with the covid confirmation number, we suspect that there would be a positive correlation between the two variables.

We then plot the month vs HPI plot for all nine reigons. We also left the line chart in the plot for the US as well for reference.

In [ ]:
housing.plot(figsize=(15,10))
Out[ ]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f5305dd0fa0>

All different areas also have the same tendency, which give us more confidence on creating multiple regression models with good fit for these states.

3. Linear regression analysis & Hypothesis testing

Here is a reference on what reigon and state we are doing regression on:

  • New England: Maine
  • Middle Atlantic: New York
  • East North Central: Wisconsin
  • West North Central: Kansas
  • South Atlantic: Maryland
  • East South Central: Alabama
  • West South Central: Texas
  • Mountain: Arizona
  • Pacific: California
In [ ]:
us_overall = us_conf.join(us_death, how='outer')

Here we integrated the three data frames so that all we could have the data aligned for the same date

In [ ]:
first = confirmed[confirmed.index.day == 1]
first = first.drop(first.index[[33, 34, 32]])
first2 = us_overall[us_overall.index.day == 1]
first2 = first2.drop(first2.index[[34,33,32]])
hypo = pd.concat([housing, first], axis=1)
hypo = pd.concat([hypo, first2], axis=1)
hypo = hypo.rename(columns={'East North Central': 'East_North','East South Central': 'East_South','Middle Atlantic': 'Middle_Atlantic','New England': 'New_England','South Atlantic': 'South_Atlantic', 'West North Central': 'West_North', 'West South Central': 'West_South', 'New York': 'New_York'})
hypo.head()
Out[ ]:
East_North East_South Middle_Atlantic Mountain New_England Pacific South_Atlantic West_North West_South USA ... Wisconsin Kansas Alabama Texas Arizona California conf_cases conf_change death_cases death_change
2020-02-01 232.67 260.54 249.97 392.87 264.70 327.64 289.09 278.23 291.92 282.50 ... 0 0 0 0 2 6 8.0 0.0 0.0 0.0
2020-03-01 235.97 262.93 251.29 397.67 268.38 331.57 290.31 281.34 294.82 285.24 ... 0 0 0 0 2 38 32.0 7.0 1.0 0.0
2020-04-01 238.33 265.19 253.43 399.73 270.87 334.22 292.29 284.73 296.96 287.59 ... 3112 970 2350 7046 2826 18530 227893.0 35814.0 6996.0 1637.0
2020-05-01 239.57 266.47 255.06 399.92 269.87 332.23 293.87 285.82 297.94 288.36 ... 14628 9268 14880 49248 15938 105092 1115845.0 34846.0 68515.0 1880.0
2020-06-01 243.02 269.12 256.54 404.90 274.06 336.72 296.63 289.66 303.16 291.92 ... 37086 19840 37050 108604 40258 232608 1808724.0 17853.0 108613.0 767.0

5 rows × 23 columns

We then plot the regression plot for the covid confirmation cases and HPI in the US. We then extract the correlation determination to showcase the how good our line fits the data points. In the end, to test on the sigificance of the regression result, we ran a regression test.

In [ ]:
a = np.array(first2["conf_cases"].values.reshape(-1, 1))
b = np.array(hypo["USA"].values)
m = linear_model.LinearRegression().fit(a,b)

plt.plot(a,b,'k.')
plt.plot(a, m.predict(a))
plt.xlabel("confirmation")
plt.ylabel("HPI")
plt.show()

print(("Coefficient of determination for covid confirmation in USA and housing price in USA"))
print(str(m.score(a,b)))

result = o(formula = "conf_cases ~ USA", data = hypo).fit()
print (result.summary())
Coefficient of determination for covid confirmation in USA and housing price in USA
0.9542227103742192
                            OLS Regression Results                            
==============================================================================
Dep. Variable:             conf_cases   R-squared:                       0.954
Model:                            OLS   Adj. R-squared:                  0.953
Method:                 Least Squares   F-statistic:                     625.3
Date:                Fri, 16 Dec 2022   Prob (F-statistic):           1.20e-21
Time:                        22:16:58   Log-Likelihood:                -548.38
No. Observations:                  32   AIC:                             1101.
Df Residuals:                      30   BIC:                             1104.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept  -2.214e+08   1.04e+07    -21.246      0.000   -2.43e+08      -2e+08
USA         7.575e+05   3.03e+04     25.007      0.000    6.96e+05    8.19e+05
==============================================================================
Omnibus:                        0.225   Durbin-Watson:                   0.330
Prob(Omnibus):                  0.893   Jarque-Bera (JB):                0.419
Skew:                           0.117   Prob(JB):                        0.811
Kurtosis:                       2.491   Cond. No.                     2.93e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.93e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

To our surprise, the line fits the datapoint realy well. As we can see, the $R^2$ value is 0.954. So, 95.4% of the HPI can be explaind by the number of covid comfirmed. In addition to that the p-value for the f-statistic in rather low (lower than default $\alpha$ 0.05) which indecates that we would be rejecting the null hypothesis $\beta_0 = 0$. This further indecates that regression relation exists.

For the nine reigons specifically, we run everything we did for the US as discribed above in a loop for nine times.

In [ ]:
hypo_state = ["Maine", "New_York", "Wisconsin", "Kansas", "Maryland", "Alabama", "Texas", "Arizona", "California"]
hpi = ["New_England", "Middle_Atlantic", "East_North", "West_North", "South_Atlantic", "East_South", "West_South", "Mountain", "Pacific"]
for s, h in zip(hypo_state, hpi):
  a = np.array(hypo[s].values.reshape(-1, 1))
  b = np.array(hypo[h].values)
  m = linear_model.LinearRegression().fit(a,b)

  print ("Regression Plot for "+ s + " and " + h +":")
  plt.plot(a,b,'k.')
  plt.plot(a, m.predict(a))
  plt.xlabel("confirmation")
  plt.ylabel("HPI")
  plt.show()

  print(("\n\nCoefficient of determination for covid confirmation in " + s + " and housing price in " + h +":"))
  print(str(m.score(a,b)))
  formula = s + " ~ " + h
  result = o(formula = formula, data = hypo).fit()
  print (result.summary())
  print ("*********************************************************************************************")
  print()
Regression Plot for Maine and New_England:

Coefficient of determination for covid confirmation in Maine and housing price in New_England:
0.864910943785443
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  Maine   R-squared:                       0.865
Model:                            OLS   Adj. R-squared:                  0.860
Method:                 Least Squares   F-statistic:                     192.1
Date:                Fri, 16 Dec 2022   Prob (F-statistic):           1.41e-14
Time:                        22:16:58   Log-Likelihood:                -403.20
No. Observations:                  32   AIC:                             810.4
Df Residuals:                      30   BIC:                             813.3
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
===============================================================================
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
Intercept   -1.361e+06   1.12e+05    -12.105      0.000   -1.59e+06   -1.13e+06
New_England  4775.6277    344.583     13.859      0.000    4071.895    5479.361
==============================================================================
Omnibus:                        2.413   Durbin-Watson:                   0.120
Prob(Omnibus):                  0.299   Jarque-Bera (JB):                1.295
Skew:                           0.100   Prob(JB):                        0.523
Kurtosis:                       2.035   Cond. No.                     2.80e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.8e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
*********************************************************************************************

Regression Plot for New_York and Middle_Atlantic:

Coefficient of determination for covid confirmation in New_York and housing price in Middle_Atlantic:
0.9242582020415863
                            OLS Regression Results                            
==============================================================================
Dep. Variable:               New_York   R-squared:                       0.924
Model:                            OLS   Adj. R-squared:                  0.922
Method:                 Least Squares   F-statistic:                     366.1
Date:                Fri, 16 Dec 2022   Prob (F-statistic):           2.32e-18
Time:                        22:16:58   Log-Likelihood:                -490.04
No. Observations:                  32   AIC:                             984.1
Df Residuals:                      30   BIC:                             987.0
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
===================================================================================
                      coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------
Intercept       -3.198e+07   1.92e+06    -16.616      0.000   -3.59e+07    -2.8e+07
Middle_Atlantic  1.232e+05   6440.837     19.133      0.000     1.1e+05    1.36e+05
==============================================================================
Omnibus:                        4.245   Durbin-Watson:                   0.254
Prob(Omnibus):                  0.120   Jarque-Bera (JB):                1.712
Skew:                           0.125   Prob(JB):                        0.425
Kurtosis:                       1.895   Cond. No.                     2.91e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.91e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
*********************************************************************************************

Regression Plot for Wisconsin and East_North:

Coefficient of determination for covid confirmation in Wisconsin and housing price in East_North:
0.9380086401061278
                            OLS Regression Results                            
==============================================================================
Dep. Variable:              Wisconsin   R-squared:                       0.938
Model:                            OLS   Adj. R-squared:                  0.936
Method:                 Least Squares   F-statistic:                     453.9
Date:                Fri, 16 Dec 2022   Prob (F-statistic):           1.14e-19
Time:                        22:16:59   Log-Likelihood:                -449.63
No. Observations:                  32   AIC:                             903.3
Df Residuals:                      30   BIC:                             906.2
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept  -1.018e+07   5.51e+05    -18.484      0.000   -1.13e+07   -9.06e+06
East_North  4.211e+04   1976.587     21.306      0.000    3.81e+04    4.61e+04
==============================================================================
Omnibus:                        3.271   Durbin-Watson:                   0.340
Prob(Omnibus):                  0.195   Jarque-Bera (JB):                1.482
Skew:                          -0.080   Prob(JB):                        0.477
Kurtosis:                       1.958   Cond. No.                     2.75e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.75e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
*********************************************************************************************

Regression Plot for Kansas and West_North:

Coefficient of determination for covid confirmation in Kansas and housing price in West_North:
0.9462160951861599
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                 Kansas   R-squared:                       0.946
Model:                            OLS   Adj. R-squared:                  0.944
Method:                 Least Squares   F-statistic:                     527.8
Date:                Fri, 16 Dec 2022   Prob (F-statistic):           1.35e-20
Time:                        22:16:59   Log-Likelihood:                -423.73
No. Observations:                  32   AIC:                             851.5
Df Residuals:                      30   BIC:                             854.4
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept  -5.082e+06   2.53e+05    -20.061      0.000    -5.6e+06   -4.57e+06
West_North  1.762e+04    766.980     22.974      0.000    1.61e+04    1.92e+04
==============================================================================
Omnibus:                        1.835   Durbin-Watson:                   0.376
Prob(Omnibus):                  0.400   Jarque-Bera (JB):                1.104
Skew:                          -0.013   Prob(JB):                        0.576
Kurtosis:                       2.091   Cond. No.                     3.36e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.36e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
*********************************************************************************************

Regression Plot for Maryland and South_Atlantic:

Coefficient of determination for covid confirmation in Maryland and housing price in South_Atlantic:
0.966012441850216
                            OLS Regression Results                            
==============================================================================
Dep. Variable:               Maryland   R-squared:                       0.966
Model:                            OLS   Adj. R-squared:                  0.965
Method:                 Least Squares   F-statistic:                     852.7
Date:                Fri, 16 Dec 2022   Prob (F-statistic):           1.37e-23
Time:                        22:16:59   Log-Likelihood:                -403.61
No. Observations:                  32   AIC:                             811.2
Df Residuals:                      30   BIC:                             814.1
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==================================================================================
                     coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------
Intercept      -2.333e+06   9.76e+04    -23.911      0.000   -2.53e+06   -2.13e+06
South_Atlantic  7964.2930    272.744     29.201      0.000    7407.275    8521.311
==============================================================================
Omnibus:                        0.055   Durbin-Watson:                   0.410
Prob(Omnibus):                  0.973   Jarque-Bera (JB):                0.105
Skew:                          -0.071   Prob(JB):                        0.949
Kurtosis:                       2.759   Cond. No.                     2.63e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.63e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
*********************************************************************************************

Regression Plot for Alabama and East_South:

Coefficient of determination for covid confirmation in Alabama and housing price in East_South:
0.9716002661382498
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                Alabama   R-squared:                       0.972
Model:                            OLS   Adj. R-squared:                  0.971
Method:                 Least Squares   F-statistic:                     1026.
Date:                Fri, 16 Dec 2022   Prob (F-statistic):           9.23e-25
Time:                        22:16:59   Log-Likelihood:                -430.10
No. Observations:                  32   AIC:                             864.2
Df Residuals:                      30   BIC:                             867.1
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept  -6.662e+06   2.48e+05    -26.835      0.000   -7.17e+06   -6.16e+06
East_South  2.498e+04    779.761     32.037      0.000    2.34e+04    2.66e+04
==============================================================================
Omnibus:                        0.551   Durbin-Watson:                   0.609
Prob(Omnibus):                  0.759   Jarque-Bera (JB):                0.623
Skew:                           0.011   Prob(JB):                        0.732
Kurtosis:                       2.317   Cond. No.                     2.60e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.6e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
*********************************************************************************************

Regression Plot for Texas and West_South:

Coefficient of determination for covid confirmation in Texas and housing price in West_South:
0.9699563858041998
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  Texas   R-squared:                       0.970
Model:                            OLS   Adj. R-squared:                  0.969
Method:                 Least Squares   F-statistic:                     968.5
Date:                Fri, 16 Dec 2022   Prob (F-statistic):           2.15e-24
Time:                        22:17:00   Log-Likelihood:                -484.08
No. Observations:                  32   AIC:                             972.2
Df Residuals:                      30   BIC:                             975.1
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept  -3.741e+07   1.42e+06    -26.361      0.000   -4.03e+07   -3.45e+07
West_South  1.256e+05   4036.038     31.122      0.000    1.17e+05    1.34e+05
==============================================================================
Omnibus:                        1.697   Durbin-Watson:                   0.574
Prob(Omnibus):                  0.428   Jarque-Bera (JB):                1.426
Skew:                           0.359   Prob(JB):                        0.490
Kurtosis:                       2.255   Cond. No.                     3.04e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.04e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
*********************************************************************************************

Regression Plot for Arizona and Mountain:

Coefficient of determination for covid confirmation in Arizona and housing price in Mountain:
0.9475860252183089
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                Arizona   R-squared:                       0.948
Model:                            OLS   Adj. R-squared:                  0.946
Method:                 Least Squares   F-statistic:                     542.4
Date:                Fri, 16 Dec 2022   Prob (F-statistic):           9.17e-21
Time:                        22:17:00   Log-Likelihood:                -453.70
No. Observations:                  32   AIC:                             911.4
Df Residuals:                      30   BIC:                             914.3
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept   -8.45e+06   4.49e+05    -18.835      0.000   -9.37e+06   -7.53e+06
Mountain    2.093e+04    898.667     23.289      0.000    1.91e+04    2.28e+04
==============================================================================
Omnibus:                        1.773   Durbin-Watson:                   0.326
Prob(Omnibus):                  0.412   Jarque-Bera (JB):                0.919
Skew:                           0.394   Prob(JB):                        0.632
Kurtosis:                       3.260   Cond. No.                     3.53e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.53e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
*********************************************************************************************

Regression Plot for California and Pacific:

Coefficient of determination for covid confirmation in California and housing price in Pacific:
0.8944032883329911
                            OLS Regression Results                            
==============================================================================
Dep. Variable:             California   R-squared:                       0.894
Model:                            OLS   Adj. R-squared:                  0.891
Method:                 Least Squares   F-statistic:                     254.1
Date:                Fri, 16 Dec 2022   Prob (F-statistic):           3.44e-16
Time:                        22:17:00   Log-Likelihood:                -514.64
No. Observations:                  32   AIC:                             1033.
Df Residuals:                      30   BIC:                             1036.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept  -4.861e+07   3.61e+06    -13.477      0.000    -5.6e+07   -4.12e+07
Pacific     1.432e+05   8981.744     15.941      0.000    1.25e+05    1.62e+05
==============================================================================
Omnibus:                        4.790   Durbin-Watson:                   0.246
Prob(Omnibus):                  0.091   Jarque-Bera (JB):                3.233
Skew:                           0.697   Prob(JB):                        0.199
Kurtosis:                       3.694   Cond. No.                     3.40e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.4e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
*********************************************************************************************

Baring the result of the regression relation of the US as a whole in mind, all the results from the regression relation and test above are somewhat expected. We have half of the data with $R^2$ value greater than that of the over all, half of them less than that of the US.

This result is also somehow unexpected because we only selected 9 representative states from all 50 states. So the data might be somewhat biased. However, the result shows stronger correlation than what we were expecting.

looking at the results we generated, we can see that the coefficient for 9 states are all greater than 0.7, which tell us that the change in House Price Index (HPI) and number of confirmed in each state has strong relationship.

All the p-values that we obtained for the f-statistics are pretty low, which indecates that we would be rejecting the null hypothesis $\beta_0 = 0$. This further indecates that regression relation exists.

5. Insights attained

Motivation: each tutorial should be sufficiently motivated. If there is not motivation for the analysis, why would we ‘do data science’ on this topic?

Understanding: the reader of the tutorial should walk away with some new understanding of the topic at hand. If it’s not possible for a reader to state ‘what they learned’ from reading your tutorial, then why do the analysis?

Resources: tutorials should help the reader learn a skill, but they should also provide a launching pad for the reader to further develop that skill. The tutorial should link to additional resources wherever appropriate, so that a well-motivated reader can read further on techniques that have been used in the tutorial.

Prose: it’s very easy to write the literal English for what the Python code is doing, but that’s not very useful. The prose should enhance, the tutorial, adding additional context and insight.

Code: code should be clear and commented. Function definitions should be described and given context/motivation. If the prose helps the reader understand why you’ve written the code, the comments in the code should be sufficient for the reader to learn how.

Pipeline: all stages of the pipeline should be discussed. We will be looking for ‘good science’, with discussion of each stage and what it’s implications/consequences are.

Communication of Approach: every technical choice has alternatives, why did you choose the approach taken in the tutorial? A reader should walk away with some idea of what the trade-offs may be.

Formatting and Subjective Evaluation: does the tutorial seem polished and ‘publishable’, or haphazard and quickly thrown together? The tutorials should read as well put-together and having undergone a few iterations of editing and refinement. This should be the easiest of the dimensions.